Goto

Collaborating Authors

 bounded off-policy evaluation


Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Neural Information Processing Systems

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.


Reviews: Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Neural Information Processing Systems

This is a key study in the OPE literature, as methods to provide better stability for off-policy methods are required for practical applications of RL. _x000B_ - Table 1 is useful - provides a good summary and comparison of existing OPE estimators. Section 2.1 further provides a good summary of existing OPE estimators based on consistency, stability and boundedness. This is well written and easy to follow - and useful for the community as it provides a direct comparison between existing OPE estimators in terms of several properties.


Reviews: Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Neural Information Processing Systems

This paper presents new estimators for Off Policy Evaluation (OPE) based on likelihoods and argues that the new estimators are better than Importance Sampling (IS). The paper provides strong theoretical guarantees of the estimators, and demonstrates their through simple experiments. The reviewers agree that the paper is well written overall and the proposed methods are technically sound and likely to be built upon by the community. One reviewer is unsure if the proposed methods will be practical in RL applications. The experiments are performed on very simple tasks.


Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Neural Information Processing Systems

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.


Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Kallus, Nathan, Uehara, Masatoshi

Neural Information Processing Systems

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them.